readability metric
Evaluating the Evaluators: Are readability metrics good measures of readability?
Cachola, Isabel, Khashabi, Daniel, Dredze, Mark
Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (10 more...)
- Overview (1.00)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.68)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)
- Energy > Renewable (0.46)
Exploring the change in scientific readability following the release of ChatGPT
The rise and growing popularity of accessible large language models have raised questions about their impact on various aspects of life, including how scientists write and publish their research. The primary objective of this paper is to analyze a dataset consisting of all abstracts posted on arXiv.org between 2010 and June 7th, 2024, to assess the evolution of their readability and determine whether significant shifts occurred following the release of ChatGPT in November 2022 . Four standard readability for mulas are used to calculate individual readability scores for each paper, classifying their level of readability. These scores are then aggregated by year and across the eight primary categories covered by the platform. The results show a steady annual decrease in readability, suggesting that abstracts are likely becoming increasingly complex. Additionally, following the release of ChatGPT, a significant change in readability is observed for 2023 and the analyzed months of 2024. Similar trends are found acr oss categories, with most experiencing a notable change in readability during 2023 and 2024. These findings offer insights into the broader changes in readability and point to the likely influence of AI on scientific writing.
- North America > United States (0.14)
- South America > Colombia (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Asia > Middle East > Saudi Arabia (0.04)
Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring
Almasi, Mina, Kristensen-McLachlan, Ross Deans
This paper investigates the potentials of Large Language Models (LLMs) as adaptive tutors in the context of second-language learning. In particular, we evaluate whether system prompting can reliably constrain LLMs to generate only text appropriate to the student's competence level. We simulate full teacher-student dialogues in Spanish using instruction-tuned, open-source LLMs ranging in size from 7B to 12B parameters. Dialogues are generated by having an LLM alternate between tutor and student roles with separate chat histories. The output from the tutor model is then used to evaluate the effectiveness of CEFR-based prompting to control text difficulty across three proficiency levels (A1, B1, C1). Our findings suggest that while system prompting can be used to constrain model outputs, prompting alone is too brittle for sustained, long-term interactional contexts - a phenomenon we term alignment drift. Our results provide insights into the feasibility of LLMs for personalized, proficiency-aligned adaptive tutors and provide a scalable method for low-cost evaluation of model performance without human participants.
- North America > Mexico > Mexico City > Mexico City (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (13 more...)
- Education > Educational Setting (0.68)
- Education > Curriculum > Subject-Specific Education (0.49)
- Education > Educational Technology > Educational Software > Computer Based Training (0.47)
The Use of Readability Metrics in Legal Text: A Systematic Literature Review
Han, Yu, Ceross, Aaron, Bergmann, Jeroen H. M.
Understanding the text in legal documents can be challenging due to their complex structure and the inclusion of domain-specific jargon. Laws and regulations are often crafted in such a manner that engagement with them requires formal training, potentially leading to vastly different interpretations of the same texts. Linguistic complexity is an important contributor to the difficulties experienced by readers. Simplifying texts could enhance comprehension across a broader audience, not just among trained professionals. Various metrics have been developed to measure document readability. Therefore, we adopted a systematic review approach to examine the linguistic and readability metrics currently employed for legal and regulatory texts. A total of 3566 initial papers were screened, with 34 relevant studies found and further assessed. Our primary objective was to identify which current metrics were applied for evaluating readability within the legal field. Sixteen different metrics were identified, with the Flesch-Kincaid Grade Level being the most frequently used method. The majority of studies (73.5%) were found in the domain of "informed consent forms". From the analysis, it is clear that not all legal domains are well represented in terms of readability metrics and that there is a further need to develop more consensus on which metrics should be applied for legal documents.
- North America > United States (0.93)
- Oceania > New Zealand (0.04)
- Oceania > Australia > Queensland (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Law > Statutes (0.69)
- Government > Regional Government > North America Government > United States Government (0.46)
MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain
Medical texts are notoriously challenging to read. Properly measuring their readability is the first step towards making them more accessible. In this paper, we present a systematic study on fine-grained readability measurements in the medical domain at both sentence-level and span-level. We introduce a new dataset MedReadMe, which consists of manually annotated readability ratings and fine-grained complex span annotation for 4,520 sentences, featuring two novel "Google-Easy" and "Google-Hard" categories. It supports our quantitative analysis, which covers 650 linguistic features and automatic complex word and jargon identification. Enabled by our high-quality annotation, we benchmark and improve several state-of-the-art sentence-level readability metrics for the medical domain specifically, which include unsupervised, supervised, and prompting-based methods using recently developed large language models (LLMs). Informed by our fine-grained complex span annotation, we find that adding a single feature, capturing the number of jargon spans, into existing readability formulas can significantly improve their correlation with human judgments. We will publicly release the dataset and code.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (19 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
Know Your Audience: Do LLMs Adapt to Different Age and Education Levels?
Rooein, Donya, Curry, Amanda Cercas, Hovy, Dirk
Large language models (LLMs) offer a range of new possibilities, including adapting the text to different audiences and their reading needs. But how well do they adapt? We evaluate the readability of answers generated by four state-of-the-art LLMs (commercial and open-source) to science questions when prompted to target different age groups and education levels. To assess the adaptability of LLMs to diverse audiences, we compare the readability scores of the generated responses against the recommended comprehension level of each age and education group. We find large variations in the readability of the answers by different LLMs. Our results suggest LLM answers need to be better adapted to the intended audience demographics to be more comprehensible. They underline the importance of enhancing the adaptability of LLMs in education settings to cater to diverse age and education levels. Overall, current LLMs have set readability ranges and do not adapt well to different audiences, even when prompted. That limits their potential for educational purposes.
- Health & Medicine > Therapeutic Area (1.00)
- Education > Educational Setting > Higher Education (0.94)
- Energy (0.93)
LC-Score: Reference-less estimation of Text Comprehension Difficulty
Tardy, Paul, Roze, Charlotte, Poupet, Paul
Being able to read and understand written text is critical in a digital era. However, studies shows that a large fraction of the population experiences comprehension issues. In this context, further initiatives in accessibility are required to improve the audience text comprehension. However, writers are hardly assisted nor encouraged to produce easy-to-understand content. Moreover, Automatic Text Simplification (ATS) model development suffers from the lack of metric to accurately estimate comprehension difficulty We present \textsc{LC-Score}, a simple approach for training text comprehension metric for any French text without reference \ie predicting how easy to understand a given text is on a $[0, 100]$ scale. Our objective with this scale is to quantitatively capture the extend to which a text suits to the \textit{Langage Clair} (LC, \textit{Clear Language}) guidelines, a French initiative closely related to English Plain Language. We explore two approaches: (i) using linguistically motivated indicators used to train statistical models, and (ii) neural learning directly from text leveraging pre-trained language models. We introduce a simple proxy task for comprehension difficulty training as a classification task. To evaluate our models, we run two distinct human annotation experiments, and find that both approaches (indicator based and neural) outperforms commonly used readability and comprehension metrics such as FKGL.
- Europe > France (0.05)
- North America > United States (0.04)
- Europe > Iceland > Capital Region > Reykjavik (0.04)
TextDescriptives: A Python package for calculating a large variety of metrics from text
Hansen, Lasse, Olsen, Ludvig Renbo, Enevoldsen, Kenneth
TextDescriptives is a Python package for calculating a large variety of metrics from text. It is built on top of spaCy and can be easily integrated into existing workflows. The package has already been used for analysing the linguistic stability of clinical texts, creating features for predicting neuropsychiatric conditions, and analysing linguistic goals of primary school students. This paper describes the package and its features.
- Europe > Denmark > Central Jutland > Aarhus (0.06)
- Asia > Middle East > Jordan (0.05)